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Abstract 

The silver carp (Hypophthalmichthys molitrix) is among tlie most intensively pond-cultured fisli species 
and is used in the wild to counteract water bloom in China. However, little genomic information is avail- 
able for this species, especially regarding its ability to grow rapidly in water, even water contaminated with 
high concentrations of poisonous microcystin. In this study, we performed de novo transcriptome assem- 
bly and analysis of the 1 7.10 million short-read sequences produced by the lllumina paired-end sequen- 
cing technology. Using an improved multiple k-mer contig assembly method coupled with further 
scaffolding, 85 759 sequences were obtained. There were 23 044 sequences annotated with 3423 gene 
ontology terms for 104 196 term occurrences and the three corresponding organizing principles. A 
total of 38 200 assembled sequences were involved in 218 predicted Kyoto Encyclopedia of Genes and 
Genomes metabolic pathways. We also recovered 41 of 44 genes involved in the biosynthesis of glutathi- 
one. Of these, five genes were identified as experienced positive selection between silver carp and zebra- 
fish, as determined by the li!<e!ihood ratio test. This report is the first annotated review of the silver carp 
transcriptome. These data will be of interest to researchers investigating the evolution and biological pro- 
cesses of the silver carp. This work also provides an archive for future studies of recent speciation and evo- 
lution of Cyprinidae fishes and can be used in comparative studies of other fishes. 
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1. Introduction 

The transcriptome is made up of the subset of 
genes active in a selected tissue and species. 
Understanding the dynamics of the transcriptome 
is essential for interpreting phenotypic variation 
caused by combinations of genotypic and environ- 
mental factors.' Massively parallel sequencing of 
RNA^ (RNA-Seq) has offered the opportunity to char- 
acterize the transcriptome with unprecedented sensi- 
tivity and depth. It has already revolutionized the way 
we study the transcriptome. The latest paired-end 
sequencing of RNA-Seq techniques have further 
improved the efficiency of DNA sequencing and 
expanded short read lengths, permitting a deeper 



understanding of the transcriptome.^ RNA-Seq is in- 
dependent of prior knowledge and does not require 
design work, thus reducing the required staff, cost 
and time and providing the unprecedented opportun- 
ity to conduct low-cost transcriptome studies at lower 
cost for non-model organisms. The RNA-Seq technol- 
ogy has been applied to many model organisms'*"^ ^ 
for the discovery of splice variants, RNA editing sites 
and new microRNAs, but fewer studies were con- 
ducted in non-model fish organisms.'^''"* 

TheActinopterygii, in terms of numbers,are the dom- 
inant class of vertebrate, comprising nearly 96% of the 
26 000 species of fish.' ^ However, the genomic informa- 
tion of this group is very rare: only six genomes' ^"'^ 
and several transcriptomes^°'^' are available. This 
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has hindered research into these valuable species. 
Cyprinidae is the largest family of freshwater fish in 
the Actinopterygii.' ^ The endemic clade of East Asian 
Cyprinidae displays a tremendous diversity of pheno- 
typic and ecological traits in this area. This clade is an 
ideal model system for the study of rapid radiations 
and evolutionary adaptation over short periods of 
time.^^ Silver carp (Hypophthalmichthys molitrix) of 
the family Cyprinidae are among the most intensively 
pond-cultured species in China. As one of famous 
four major Chinese carps, breeding production 
reached 3 million tons in China in 2009.^^ Aside 
from their great importance to the fishery economy, 
silver carp have been found useful in counteracting 
cyanobacteria blooms in China. ^"^'^^ Silver carp are 
also a good model for the study of speciation because 
of its split with bighead carp (J-iypophthalmichthys 
nobilis), which occurred only ~3 Mya.^^ However, 
lack of genomic resources like genome sequence, tran- 
scriptome sequences and molecular markers has made 
the study of silver carp breeding, the mechanism of its 
ability to counteract water bloom and evolutionary 
analysis a difficult task. 

When no genome sequence is available, transcrip- 
tome sequencing is an effective way to obtain large 
numbers of molecular makers and identify transcripts 
involved in specific biological processes. In this study, 
we present the first silver carp transcriptome using 
massively parallel mRNA sequencing. We perform 
lllumina sequencing of the heart, liver, brain, spleen 
and kidney tissues to characterize the H. molitrix tran- 
scriptome. A database (Silver Carp Base) is under con- 
struction and we expect that it will provide the first 
picture of the transcriptome of this species. The data- 
base will be updated in the future if additional data 
become available. 

2. Materials and methods 

2. / . Ettiics statement 

All experimental protocols were approved by the 
ethics committee of Institute of HydroBiology, 
Chinese Academy of Sciences. 

2.2. Organ collection and RNA isolation 

A wild silver carp was collected from the middle 
reach of the Yangtze River. To obtain the whole tran- 
scriptome, RNA from five organs (heart, liver, brain, 
spleen and kidneys) was extracted using TRIzol 
reagent (Invitrogen, Carlsbad, CA, USA). After the 
quality examination by the way of electrophoresis 
and a BioPhotometer plus 6132 (Eppendorf, 
Germany), RNAs from different organs were mixed 
together at equivalent concentrations. Total RNA 
extraction was in accordance with the manufacturer's 



protocol and it was treated with RNase-free DNase I 
(New England Biolabs) for 30 min at 37°C to 
remove the residual DNA. 

2.3. cDNA library preparation and sequencing 
Beads with oligo(dT) were used to purify poly(A) 

mRNA from total RNA. Then, the mRNA was fragmen- 
ted using a RNA fragmentation kit (Ambion). First- 
strand cDNA was synthesized using random hexamer- 
primer and reverse transcriptase (Invitrogen), and 
second-strand cDNA was synthesized next. Then the 
paired-end cDNA library was prepared in accordance 
with lllumina's protocols with an insert size of 2 00 bp 
and sequenced for 75 bp. The lllumina GA processing 
pipeline vO.2.2.6 was used to analyze the image and 
for base calling. 

2.4. De novo assembly of silver carp transcriptome 
As no optimal k-mer length is appropriate for all de 

novo transcriptome assemblies, the multiple k-mer 
method was used to obtain longer silver carp mRNA 
sequences, which are very useful in subsequent ana- 
lysis steps. Our method is based on the modified 
'additive Multi-fc' method described by Yann Surget- 
Groba^^ After removing reads with the sequencing 
adapter and reads of low quality, paired-end reads 
were subjected to de novo assembly using ABySS^^ 
with k-mer lengths of 58, 54, 52, 50, 48, 46, 44, 
42, 40, 38 and 34. The unused reads at higher k- 
mer lengths were not discarded before running the 
assembly for a lower k-mer length. The output data 
set of each k-mer length was subjected to SSPACE^^ 
for scaffolding, respectively. When pooling all the 
results together, some contigs and scaffolds appeared 
in two or more assemblies, causing redundancy. These 
were removed using CD-HIT-EST.^° The longest 
possible contigs and scaffolds were retained. At last, 
the STM+ method^^ was used to perform translation 
mapping scaffolding with the Danio rerio proteome^' 
serving as a reference. 

2.5. Sequence annotation 

The assembled sequences were blasted against 
the NCBI Nr (non-redundant) protein database and 
Swiss-prot database using BLASTX^^ and an £-value 
of 1e-5. To shorten the search time, searches were 
limited to the first 1 0 significant hits for each query. 
Gene names were assigned to each sequence 
according to its best BLAST hit (highest score). 

The Blast2GO suit^^ was used for functional anno- 
tation of assembled sequences applying the function 
for the mapping of gene ontology (GO) terms to 
sequences with BLAST hits obtained from hits with 
E-value < 1e-5, annotation cut-off > 55 and a GO 
weight > 5 were used for annotation. Assembled 
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sequences were thus assigned to primary and sub-GO 
functional categories. 

2.6. Simple sequence repeat markers discovery 

A microsatellite program (MISA)^'* (http://pgrc. 
ipl<gatersleben.de/misa/) was used to identify and 
localize microsatellite motifs. We searched for all 
types of simple sequence repeats (SSRs) from mono- 
nucleotide to hexanucleotides using the following 
parameters: at least 1 0 repeats for mono-, 6 repeats 
for di- and 5 repeats for tri-, tetra-, penta- and hexa- 
nucleotide for simple repeats. Both perfect (i.e. SSRs 
contain a single repeat motif like such as 'ATC') and 
compound (i.e. composed of two or more motifs 
separated by <1 00 bp) SSRs were identified. 

2.7. Positive selection 

All the 44 sequences involved in the biosynthesis of 
glutathione (GSH) were downloaded from Kyoto 
Encyclopedia of Genes and Genomes (KEGG) and 
used as a query to blast against the 85 759 sequences 
assembled. Only the reciprocal BLAST best-hit result 
sequence was kept to form the sequence pair with 
its corresponding query. To determine whether the 
sequence pair underwent positive selection, a likeli- 
hood ratio test (LRT) was performed for the Nsite 7 
and Nsite 8 of codemi in PAML was used.^^ 

3. Results 

3.1 . De novo assembly with multiple k-mer lengths 
and sequence validation 

We prepared the mixed cDNAs from the heart, liver, 
brain, spleen and kidneys of silver carp at equivalent 
concentration. One lane of lllumina Genome 
Analyzer was performed and ~17.10 million 75 bp 
paired-end reads were obtained. After cleaning the 
low-quality reads, we used a modified version of a pre- 
vious published procedure^^ (see Materials and 
methods) to assemble the reads for non-redundant 
consensus. The bioinformatics workflow is depicted 
in the flowchart shown in Supplementary Fig. IS. 
Short read data have been deposited in NCBI's 
Short Read Archive at http://www.ncbi.nlm.nih.gov/ 
sra under the accession SRP0081 33. 

To assemble the paired-end reads into contigs, we 
used ABySS^^ with different k-mer lengths (Table 1). 
Although paired-end information has been used in 
ABySS, a great improvement was found after scaffold- 
ing the contigs with SSPACE^^ (Table 2). After pooling 
all the scaffolds obtained from multiple k-mers 
together, 3 930 925 sequences were collected. 
Using CD-HIT-EST,^° scaffolds were assembled into 
clusters that were analyzed for consensus. Finally, 
85 796 sequences ranging from 200 to 13 880 bp 



Table 1. Summary statistics of the assemblies used to assess the 
performances of the Mulit-K de novo assembly method 



Method 


k-mer Contig > 1 DC 


1 N50 


Max 
length 


Total 

length 

(Mb) 


Average 

contig 

size 


single K 


58 


3328 


65 


3324 


2.076 


87 




54 


22 397 


1 59 


5597 


8.1 97 


127 




52 


37 207 


233 


6087 


1 2.602 


140 




50 


51 041 


239 


8297 


1 8.059 


142 




48 


61 097 


241 


8045 


23.245 


144 




46 


69 71 7 


242 


8358 


28.235 


145 




44 


77 038 


239 


1 1 062 


33.1 33 


142 




42 


82 806 


236 


1 0 004 


37.633 


1 39 




40 


87 673 


233 


7322 


41.928 


1 35 




38 


91 694 


228 


1 0 092 


45.964 


1 30 




34 


97 1 53 


220 


1 3 873 


53.206 


119 


multi K 




1 1 8 764 


257 


1 3 880 


58.075 


1 59 


These statistics correspond to the set of contig > 1 00 bp. 
k-mer, required length of identical overlap match between 
two reads by ABySS; N50, contig length-weighted median; 
max length, length of the longest contig; (Total length) 
summed length of all contig > 1 00 bp. 


Table 2. Summary statistics of the scaffolds produced by SSPACE 


k-mer 


scaffold > 1 00 N50 1 

1 


Vlax 
length 


Total length Average 
(Mb) scaffold size 


58 


2805 


65 


4835 


2.074 


92 


54 


1 9 1 84 


241 


7069 


8. 


1 65 


1 35 


52 


31314 


279 


1 1 041 


1 2.561 


1 51 


50 


41815 


301 


1 1 669 


1 8.033 


1 55 


48 


49 814 


325 


1 2 403 


23.239 


1 59 


46 


57 241 


332 


1 0 964 


28. 


259 


160 


44 


63 799 


324 


1 1 062 


33.1 96 


1 56 


42 


69 856 


314 


10950 


37.. 


804 


1 52 


40 


74 827 


302 


1 2 1 40 


42.046 


146 


38 


79 408 


291 


1 1 339 


46.097 


1 39 


34 


87 408 


270 


1 3 880 


53.. 


831 


127 



These statistics correspond to the set of scaffold > 1 00 bp. 
k-mer, required length of identical overlap match between 
two reads by ABySS; N50, scaffold length-weighted 
median; max length, length of the longest scaffold; total 
length, summed length of all scaffold > 1 00 bp. 



were collected. The length distribution of all the 
sequences is shown in Fig. 1 . 

To determine the expression level of the transcripts, 
we mapped the raw reads to the assembled sequences 
with SOAP^*^ and the RPKM value (Reads Per Kilobase 
of exon model per Million mapped reads) of all the 
transcripts are shown in Supplementary Table SI. 
Figure 2 depicts the relationship of RPKM versus 
the transcript size. Transcript length increased with 
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coverage depth and reached an asymptote approxi- 
mately at an average coverage of ~50. 

Until now, no general criteria have been proposed 
as standards for evaluation of the quality of transcrip- 
tome assembly. We used three substantial factors to 
assess how well the assembled sequences represent 
the actual transcriptome population: (i) gene 
coverage, (ii) transcript sequence quality and (iii) 
completeness. 

The transcriptome gene coverage was judged by 
comparison with the sequence information available 
for silver carp. All 1 3 mitochondrial protein-coding 
genes and 2 03 of 21 7 proteins in the NCBI database 
were present in our assembled scaffolds. We com- 
pared our assembled scaffolds with the zebrafish tran- 
scriptome (ENSEMBLZV61) and found that 40 509 of 
41 759 (85.9%) zebrafish transcripts have matches 
in assembled scaffolds. At the same time, 19 893 
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Figure 1. Length distributions of scaffolds assembled by a multiple 
k-mer method. 



reciprocal best-hit blast matches with the zebrafish 
transcriptome were identified using £-value le-5. 

Transcriptome quality was assessed by comparing 
the mitochondrial protein-coding genes found in 
assembled sequences to mitochondrion sequence in 
GenBank (NC_01 01 56). A total of 1 0 1 85 nucleotide 
identities were observed out of 1 0 522 (96.8%) total 
nucleotide length of contig to coding mitochondrial 
sequences BLAST matches, suggesting very good 
transcriptome sequence quality. The observed 3.2% 
sequence difference might be due to the high intra- 
specific genetic variability. 

Finally, in terms of sequence completeness, the rela- 
tive number of full-length sequences in the 19 893 
reciprocal best-hit blast matches to zebrafish tran- 
scriptome was estimated. A sequence was considered 
full length if it contained the complete 5'- and 3'-UTR 
of the mRNA. In this study, we used a less stringent 
but broadly adopted definition, considering a se- 
quence to be full length if it comprised at least the 
complete coding sequence (CDS).^^ We mapped 
the 1 9 893 sequences to their corresponding CDS in 
the zebrafish transcript, and if the CDS was fully 
covered by assembled sequence, we thought the 
sequence as full-length sequence. Under the criteria 
given above, 1 937 sequences (9.7%) were validated 
as full length. One thousand, six hundred and thirty- 
five sequences (8.2%) covered more than 9 5% of the 
zebrafish CDS and 2394 sequences (1 2.0%) covered 
more than 75% of the zebrafish CDS. For a pseudo- 
stop codon usually appears on a chimeric or truncated 
transcript, we translated the nucleotide sequences 
to protein sequences to verify the completeness 
of the transcripts. One thousand, three hundred 
and seventy-seven sequences were validated as full 
length and 21 09 sequences covered more than 95% 
of the zebrafish CDS. 




RPKM 

Figure 2. The relationship of RPKM versus the transcript size. RPKM, Reads Per Kilobase of exon model per Million mapped reads. 
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Figure 3. Functional classification of silver carp transcriptome and comparison with zebrafish transcriptome. (A) GO: biological process. (B) 
GO: molecular function. (C) GO: cellular component. In some cases, one transcript may have multiple functions. Grey, silver carp; black, 
zebrafish. 



In addition to the computing metfiods given above, 
Reverse transcription polymerase chain reaction (RT- 
PCR) assay was used to validate the quality of the 
assembled transcriptome. Primers for 22 transcripts 
with different expression levels (RPKM ranged from 
336 to 10 507) were designed and all the cDNAs 
were amplified. Out of the 22 pairs of primers, 10 
pairs were silver-specific transcripts which did not 
have an NCBI Nr BLAST hit (see Sequence annotation). 
The primer information and RT-PCR results are 
shown in Supplementary Table S2 and Fig. S2. 



3.2. Sequence annotation 

Several complementary methods were used to an- 
notate the assembled sequences. First, the assembled 
sequences were searched against the Nr protein data- 
bases using BLASTX with an E-value of le-5. Of the 
85 796 assembled sequences, 54 198 (63.2%) had 
significant matches (Supplementary Table S3). Most 
of the sequences with top-hit blast result from zebra- 
fish (44 999 sequences; 83.0%). In addition, 18 536 
(34.2%) sequences matched predicted proteins, 81 
(0.1%) with unknown proteins. 

Second, silvercarp sequences that had matches in Nr 
databases were given GO annotations with the Uniprot 
database. Of these, 2 3 044 were assigned to one or 
more 3423 GO terms, for a total of 1 04 1 96 term 
occurrences. As many as 1 7 451 sequences were 



found to be involved in biological process and could 
be divided into cellular process (13 382 sequences 
with percentage of 76.7%), metabolic process 
(10 846; 62.2%), biological regulation (5032; 
28.9%), multicellular organismal process (4386; 
25.1%), pigmentation (4296; 24.6%), localization 
(4162; 23.8%), developmental process (4088; 
23.4%), establishment of localization (3448; 1 9.8%), 
cellular component organization (2793; 16.0%) and 
response to stimulus (2275; 13.0%). Other type of 
functions occurred at <1 0%each (Fig. 3). 

GO analysis have also shown that 1 5 799 sequences 
were associated with a cellular component, including 
cell (15 11 4; 95.6%), cell part (15 1 1 4; 9 5.6%), organ- 
elle (8265; 52.3%), organelle part (3624; 22.9%) and 
macromolecular complex (3042; 19.2%). Moreover, 
1 7 837 sequences showed potential molecular func- 
tion, such as binding (1 3 02 7; 73%), catalytic activity 
(8473; 47.5%), molecular transducer activity (1 648; 
9.2%) and transporter activity (1406; 7.8%). The 
detailed information about the functional classifica- 
tion is shown in Supplementary Table S3. 

Representation of GO categories in the silver carp 
transcriptome set was found to be similar to that of 
the zebrafish GO database, but there were a few 
differences in each of the three main GO categories 
(Fig. 3). After correcting for multiple tests, we found 
that 30 of 3 7 comparisons were significantly over or 
underrepresented in comparison to the zebrafish 
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CDEFGHIJKLMNOPQ 

Function Class 



A: RNA processing and modification 

B: ChromaCn structure and dynamics 

C : Energy producDon and conversion 

D: Cell cycle control, cell division, chromosome partitioning 

E : Amino acid transport and metabolism 

F: Nucleotide transport and metabolism 

G: Carbohydrate transport and metabolism 

H: Coenzyme transport and metabolism 

I: Lipid transport and metabolism 

J : Translation, ribosomal structure and biogenesis 

K: Transcription 

L: Replicabon, recombinabon and repair 
M : Cell wall^mbrane/envelope biogenesis 
NiCell motility 

0: Posttranslabonal iDodificabon, protein turnover, chaperenes 

P : Inorganic Ion transport and metabolism 

Q : Secondary metabolites biosynthesis, transport and catabolism 

R : General function prediction only 

S: Function unknown 

T: Signal bansducbon mechanisms 

U : Intracellular bafficking, secrebon, and vesicular transport 

V: Defense mechanisms 

W : Extracellular sbuctures 

Y: Nuclear structure 

Z: Cytoskeleton 



Figure 4. COG annotations of putative proteins. All putative proteins were aligned to COG database and can be classified functionally into 
at least 2 5 molecular families. 



records. For example, among the biological processes, 
pigmentation (GO: 0043473) was underrepresented 
in the silver carp transcriptome, while localization 
(GO: 0051 1 79) and response to stimulus (GO: 
0050896) were overrepresented. 

Meanwhile, annotation of the 85 759 sequences 
using Clusters of Orthologous Groups of protein 
(COG) databases yielded good results for 14 840 
putative proteins. The COG-annotated putative pro- 
teins ranged functionally into at least 2 5 molecular 
families, including biochemical metabolism, signal 
transduction, cellular structure and immune defense, 
in accordance with the categories observed in GO 
annotation (Fig. 4). 

3.3. Metabolic pathways by KEGG analysis 

A total of 38 200 assembled sequences were found 
to be involved in 218 predicted KEGG metabolic 
pathways. The number of sequences ranged from 3 
to 451 0 (Supplementary Table S4). The top 20 path- 
ways with the greatest number of sequences are 
shown in Table 3, and the greatest number of tran- 
scripts was found in the metabolic pathways. The 
top 1 0 metabolic pathways were: purine metabolism 
(789), pyrimidine metabolism (473), oxidative 



phosphorylation (436), inositol phosphate metabol- 
ism (43 5), glycerophospholipid metabolism (371), 
riboflavin metabolism (347), glycolysis/gluconeogen- 
esis (341 ), lysine degradation (33 7), pyruvate metab- 
olism (239) and starch and sucrose metabolism 
(21 8) (Supplementary Table S5). 

3.4. Positive selection of genes involved in GSH 
synthesis 

Microcystins (MCs) are cyclic non-ribosomal pep- 
tides produced by cyanobacteria. They are cyanotox- 
ins and can be very toxic to fishes and other 
animals, including humans. With the increasing fre- 
quency of water bloom outbreaks in many countries, 
the task of eliminating them has become both more 
urgent and more difficult. Recently, the silver carp 
and bighead carp have been used to counteract 
cyanobacteria in many lakes in China. Despite 
the hepatotoxicity of MC, the body weights of silver 
carp increase very fast in bodies of water that are 
full of MCs.^^ The high tolerance of silver carp to 
MCs might be due to the high basic GSH level in the 
liver or an increased GSH synthesis.^^ 

To fully understand the mechanism behind the 
high tolerance of silver cap to MCs, we evaluated 
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Table 3. The top 20 pathways with highest sequence numbers 



Num 


Pathway 


All genes with 
pathway 
annotation 
(38 200) 


Pathway 
ID 


1 


Metabolic pathways 


4b 1 (J (,1 1 .O 1 ^j 


KOU 1 1 uU 


2 


Pathways in cancer 


1 790 (4.59%) 


koOSzOO 


3 


Regulation of actin 
cytoskeleton 


1 534 (4.28%) 


I<o048 1 0 


4 


Focal adhesion 


1518 (3.97%) 


l<o0451 0 


5 


MAPK signaling pathway 


1453 (3.83%) 


ko0401 0 


6 


Endocytosis 


1 345 (3.52%) 


1 r\ A A A A 

ko041 44 


7 


Tight junction 


1 256 (3.29%) 


ko04530 


8 


Adherens junction 


1073 (2.81%) 


ko04520 


9 


Phagosome 


1034 (2.71%) 


ko04145 


10 


Dilated cardiomyopathy 


1 027 (2.69%) 


l<o05414 


1 1 


Vascular smooth muscle 
contraction 


1014 (2.65%) 


l<o042 70 


1 2 


Complement and coagulation 
cascades 


1 005 (2.63%) 


l<o0461 0 


1 3 


Hypertrophic cardiomyopathy 
(HCM) 


957 (2.51 %) 


ko0541 0 


1 4 


Chemokine signaling pathway 


955 (2.5%) 


ko040b2 


1 5 


Calcium signaling pathway 


942 (2.47%) 


ko04020 


16 


Axon guidance 


939 (2.46%) 


l<o043 60 


1 7 


Insulin signaling pathway 


912 (2.39%) 


ko0491 0 


1 8 


Huntington's disease 


907 (2.37%) 


ko0501 6 


19 


Leukocyte transendothelial 
migration 


869 (2.27%) 


ko04670 


20 


Protein processing in 
endoplasmic reticulum 


864 (2.26%) 


ko04141 



whether the genes involved in glutathione synthesiz- 
ing were under positive selection in silver carp. From 
the KEGG databases, 44 genes involved in glutathione 
synthesis in zebrafish were obtained with the full CDS 
region (PATHWAY dre00480). After searching against 
the whole transcriptome sequences, we found that 
most of the zebrafish CDS had been recovered 
(Table 4). Sequence pairs were constructed by the 
zebrafish CDS, and its corresponding best-hit blast 
scaffolds of silver carp were thereafter tested for 
whether they had experienced positive selection. For 
these 44 genes, the average number of codons was 
found to be 356 (range, 1 37-966). The F3 x 4 
model of codon frequencies was used and models 
M7 and M8 were used to determine which pairs of 
sequences were under positive selection. The log-like- 
lihood value under M8 was much higher than its cor- 
responding value under M7, indicating that model 
MS is more suitable to the sequence pair compared 
with model M7. LRT shows that five sequence pairs 
were found to have P-values <0.05. They are 



thought to have experienced positive selection 
between silver carp and zebrafish (Table 5). 

3.5. Identification ofSSR or microsatellites 

Because SSRs or microsatellite markers are used for 
many animal breeding applications, the 85 796 
sequences were analyzed for identification of SSR 
markers. We obtained 1 3 327 SSR markers in 9636 
sequences with the MISA.^'^ In terms of abundance, 
mononucleotide repeats were found to be most abun- 
dant (7693, 57.3%) followed by dinucleotide repeats 
(3733, 28.0%) and trinucleotide repeats (1 538, 
1 1 .5%). Other type of repeat units occurred at <2% 
each. SSR markers were divided into two groups, 
perfect SSR markers (only one single repeat motif 
such as 'AGC') and compound SSR markers (composed 
of two or more SSR markers separated by <1 00 bp). A 
total of 1206 (9.0%) compound SSR markers were 
identified. After excluding the mono-nucleotide 
repeats, the frequency of an SSR motif was calculated. 
Among the dinucleotide repeat motifs, AC/GT was the 
most abundant, with 69.2%; trinucleotide repeat 
motifs were rich in ATC/GAT, with 2 7.8%, and tetranu- 
cleotide repeat motifs were AGAT/ATCT, with 21.1%. 

4. Discussion 

The transcriptome is the complete repertoire of 
expressed RNA transcripts in the cell and its character- 
ization is essential to understanding the functional 
complexity of the genome. Using the next-generation 
sequencing technology, we were able to sequence and 
annotate the transcriptome of silver carp. This is most 
comprehensive study of silver carp transcriptome data 
to date. The transcriptome sequences obtained by 
this study are useful to the understanding of the 
genetic makeup of the silver carp transcriptome, 
which until now has been very limited. The lllumina 
sequencing yielded 17.10 million paired-end reads 
for silver carp. The 85 769 sequences produced here 
may be useful for further research into silver carp 
functional genomics. The obtained overall GC 
content of the silver carp transcriptome was 39.2%, 
which was lower than the GC content of cDNA 
library of zebrafish (EnsembI 61).^^ However, when 
we removed the assembled sequences that contained 
gaps, the GC content rose to 45.5%, which was similar 
to that of the zebrafish cDNA library (46.2%). 

Due to the lackof a complete genome sequence, the 
quality of transcriptome analysis of non-model species 
must rely largely on the contigs and scaffolds 
assembled from the raw reads. After reassembling 
the transcriptomes of two mosquitoes with known 
genomes using a de novo assembler. Gibbons et al."^" 
found that short reads can be used to assemble 
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Table 4. Sequences recovered in the glutathione synthesizing 


; pathway 






npsrri nt inn 


Length 


f V 1 u Lv^ L 1 


dre:l 000021 45 


Gamma-glutamyltranspeptidase 


2082 


0 


dre:l 00006589 


Isocitrate dehydrogenase 1 (NADP+) 


1 290 


1 254 


dre:l 001 24622 


Glutathione S-transferase 


672 


514 


dre:l 00330864 


Ribonucleoside-diphosphate reductase subunit M2-lil<e 


1 1 61 


1057 


dre:1 00333757 


Camma-glutamyltransferase 5-like 


1 521 


1 162 


dre:l 14426 


Ornithine decarboxylase 


1 386 


1 370 


dre:30733 


Ribonucleotide reductase M2 polypeptide 


1 1 61 


1 057 


dre:30740 


Ribonucleotide reductase Ml polypeptide 


2385 


2385 


dre:322533 


Many! (membrane) aminopeptidase b 


2898 


1238 


dre:324366 


Glutathiones-transferase M 


660 


658 


dre:324900 


Protein-disulfide reductase (glutathione) 


519 


51 5 


dre:326857 


Glutamate-cysteine ligase, catalytic subunit 


1 896 


1 896 


dre:333974 


Glutamate-cysteine ligase, modifier subunit 


822 


736 


dre:352926 


Glutathione peroxidase 1 a 


576 


565 


dre:352928 


Glutathione peroxidase 4a 


561 


561 


dre:352929 


Glutathione peroxidase 4b 


576 


576 


dre:386951 


Isocitrate dehydrogenase 2 (NADP+), mitochondrial 


1 350 


1 350 


dre:394009 


Spermidine synthase 


870 


749 


dre:406278 


Gamma-glutamylcyclotransferase 


663 


600 


dre:406703 


Glutathione S-transferase, alpha-like 


672 


571 


dre:406736 


Glutathione S-transferase 


453 


425 


dre:406762 


Phosphogluconate hydrogenase 


1 536 


141 8 


dre:431 762 


Glutathione S-transferase 


459 


451 


dre:436833 


Glutathione S-transferase kappa 1 


690 


680 


dre:436894 


Glutathione S-transferase 


723 


722 


dre:449784 


Microsomal glutathiones-transferase 1 


465 


244 


dre:450084 


Glutathione synthetase 


1 428 


1 385 


dre:552981 


Glutathione peroxidase 7 


561 


0 


dre:5531 69 


Glutathiones-transferase pi 2 


627 


625 


dre:553575 


Glutathione reductase (NADPH) 


1 278 


1 257 


dre:555478 


Aminopeptidase N 


2883 


763 


dre:562854 


Leucine aminopeptidase 3 


1 554 


1 320 


dre:563972 


Glutathione S-transferase theta la 


729 


729 


dre:566746 


Gamma-glutamyltranspeptidase 


1 773 


82 


dre:567275 


Glutathione S-transferase 


423 


423 


dre:568744 


Glutathiones-transferase M3 


660 


658 


dre:56901 4 


Gamma-glutamyltranspeptidase 1 -like 


1 725 


281 


dre:570579 


Glucose-6-phosphate dehydrogenase 


1 572 


1 572 


dre:571 365 


Glutathione S-transferase 


660 


658 


dre:723997 


Microsomal glutathiones-transferase 2 


41 1 


0 


dre:79381 


Glutathiones-transferase pi 


627 


626 


dre:798788 


Glutathione peroxidase 3 


669 


529 


dre:799288 


Glutathione S-transferase 


672 


512 


dre:80872 


Spermine synthase 


1 083 


1 057 



transcriptomes of non-model organisms. Although the more and more reads, de novo assembly of transcrip- 
development of the short-read assembler^^'"^^'"^^ has tomes without known reference genome using short 
rendered research facilities capable of dealing with reads is still difficult for transcripts with highly variable 
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Table 5. Genes determined to be under positive selection 



Gene id 


Model 


Log lil<elihood 


dN/dS 


Estimates of parameters 


Sites under selection 
(P> 0.95) 


dre_322533 


M 7 (beta) 

M 8 (beta and u>) 


-4530.079761 
-451 9.934204 


0.3750 
0.7805 


p= 0.00500 and 0.00810 
Po = 0. 93767, p = 0. 04957, 
q = 0.14698 and iv^ 8.71 91 4 


1 64,1 67,256 


dre_406703 


M 7 (beta) 


-1 299.1 59488 


0.1 663 


p= 0.13888 and q= 0.68894 


No 




M 8 (beta & oo) 


-1 295.384925 


1 6.1 965 


Po=0.94248, p=9.04140, 

q = 99.00000 and 280.21 751 




dre_563972 


M 7 (beta) 


-1 399.1 09283 


0.3750 


p= 0.00542 and q = 0.0091 2 


224,226,227,233 




M 8 (beta & u) 


-1 390.99741 8 


2.5343 


po = 0. 91 793, p = 9. 32406, 

(J = 32.38965 and u'^ 28.3871 0 




dre_79381 


M 7 (beta) 


-1 1 05.063900 


0.1 428 


p= 0.03221 and q= 0.1 8627 


1 29,1 74 




M 8 (beta & oo) 


-1 1 00.372249 


14.0364 


Po = 0.97094, p= 7.50265, 

q = 99.00000 and u'^ 480.64973 




dre_799288 


M 7 (beta) 


-1334.096144 


0.1 833 


p= 0.1 2802 and q= 0.56863 


No 




M 8 (beta & oo) 


-1 328.43291 4 


1 3.0557 


po = 0. 92145, p = 9. 04080, 

q = 99.00000 and 1 65.23531 





coverage.'^^ So, a higher k-mer length will theoretically 
generate a more contiguous assembly of highly 
expressed RNAs while poorly expressed transcripts 
will be more easily obtained if a lower k-mer length 
is used."^^ Therefore, an approach forde novo assembly 
of the transcriptome using various k-mer lengths is 
highly desirable and has been proven useful. The 
final assembly statistics indicate that the multiple 
k-mer method used in this study outperforms all 
other single k-mer methods (Table 1). In the single 
k-mer assembly, the average length and N50 were 
highest when the k-mer was set to 46, which we 
found to be best in all single k-mer assemblies. 
However, the number of contigs >1 00 bp and total 
length were twice that of the best single k-mer ABySS 
assembly. This marked increase was accompanied 
by a higher N50 and average contig size, indicating a 
substantial improvement in contiguity. 

Multiple k-mer methods assembled the lllumina 
reads into contigs, but the location information in 
paired-end reads were not used at all. In this study, 
we improved upon the methods described by Yann 
et al.^^ Results proved that using SSPACE to scaffold 
the contigs produced by each k-mer could produce 
longer sequences. The statistics before and after scaf- 
folding (Tables 1 and 2) indicate that a higher average 
length of sequence can be obtained by joining the 
two contigs originating from the two ends of a DNA 
fragment. The max length and the average length of 
the sequences after scaffolding were further extended. 
To assess the quality of assembled transcriptome, we 
used both computing and experimenting assays to 
validate the transcripts generated. The RT-PCR 
results confirmed that our method is reliable for the 
recovery of both highly and poorly expressed 
transcripts. 



Both gene annotation and KEGG pathway analyses 
are useful for us to predict potential genes and their 
functions at a whole-transcriptome level. In the 
silver carp transcriptome, as discovered by this study, 
the predominant gene clusters are involved in the 
structural formation of the cell, cell part and organelle 
of a cellular component, the binding and catalytic 
activity of molecular function, metabolic process and 
cellular processes of biological processes. Similar 
results were found in Sus scrofa,'^'^ European eeP^ 
and rainbow trout. However, in Chickpea transcrip- 
tome, sequences were found to be mainly involved in 
the protein metabolism of biological process, in chlor- 
oplasts, in the transferase activity of molecular func- 
tion. This suggests remarkable difference between 
animal and plants. KEGG analysis showed that more 
than 44.5% of transcripts to be enrichment factors 
involved in 218 known metabolic or signaling path- 
ways, including cell adherence, migration, apoptosis 
and immune-related processes. The KEGG pathway 
analysis and gene annotation may be useful for 
further investigation of gene function in future. 
Although there are differences between our silver 
carp transcriptome and available database for zebra- 
fish in GO annotations, concordance in the overall 
patterns suggests that our library were widely 
sampled and provided a good representation. 

One previous study reported that GSH can conju- 
gate with MC on its sulfhydryl, which is the first step 
in the detoxification of a cyanobacterial toxin in 
aquatic organisms.'^^ The glutathione S-transferase 
(GST) gene plays important roles both in the biosyn- 
thesis of GST and catalysis of the reaction between 
GSH and MC. M8 assumes 1 1 site classes: 1 0 classes 
for the beta distribution and 1 class for the positively 
selected site. Therefore, it is suitable to detect positive 
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selection in sequence pairs. Although the value of diV/ 
dS observed in this model might not be precise, the 
LRT is most likely reliable. Positive selection pressure 
on GST gene might be the result of the adaptation 
of silver carp to the eutrophied bodies of \Nater in 
the middle and lower reaches of the Yangtze River. 

Genetic markers are of great importance to the 
understanding genetic variation and to the identifica- 
tion of genes and quantitative trait locus for traits of 
interested in molecular breeding applications. Until 
now, only a small number of genetic markers have 
been available for silver carp.'^^''^^ One of the main 
reasons for this is the lack of genome sequence infor- 
mation. Alternatively, transcriptomes have been used 
for the discovery of genetic markers.'^'^''^^'^° 
Although markers developed from transcriptomes 
are less polymorphic, they have been found to be 
very useful in trait mapping^^ and comparative gen- 
omics studies.^^ 

It has been reported that SSRs comprise 3% of the 
human genome, with the largest proportion of them 
being dinucleotide repeats (0.5%).^^ In this study, 
28% of the 1 3 327 silver carp SSRs were found to 
be dinucleotide repeats, followed by trinucleotide 
repeats (11.5%). The most common dinucleotide 
repeats were AC and AG, in contrast to those found in 
the human genome (AC and AT). The same difference 
was also found in trinucleotide repeats, with ATC and 
AGG being most common in silver carp and AAT and 
AAC being most common in human. 

In conclusion, we have determined the transcrip- 
tome of silver carp through use of high-throughput 
lllumina paired-end sequencing. Our study obtained 
85 759 scaffolds and demonstrated some important 
features of silver carp transcriptome, such as gene an- 
notation and KEGG pathway analysis, as shown by 
cross-transcriptome analysis. In addition, we identi- 
fied reliable genetic markers for 1 3 324 SSRs. We 
also found that five genes identified as under positive 
selection between silver carp and zebrafish. This study 
will be helpful for improvement of the understanding 
of the recent speciation and adaption of Cyprinidae 
and provides useful resources and markers for future 
functional genomic research. 

Supplementary Data: Supplementary Data are 
available at www.dnaresearch.oxfordjournals.org. 
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